Goto

Collaborating Authors

 great expectation


Best prompts to get the most out of an AI chatbot

FOX News

Many people are turning to chatbots as search engines. Kurt "The CyberGuy" Knutsson explains how to use them. People love using AI chatbots to assist them with tasks or to simply answer a question that they don't know the answer to. However, a chatbot can only answer to the best of its ability, and we have to also do our part to help it answer our questions as accurately as possible. CLICK TO GET KURT'S FREE CYBERGUY NEWSLETTER WITH SECURITY ALERTS, QUICK TIPS, TECH REVIEWS AND EASY HOW-TO'S TO MAKE YOU SMARTER If you don't get specific enough with how you present your prompts, a chatbot could give you a generic or even incorrect answer.


How to Test PySpark ETL Data Pipeline

#artificialintelligence

Garbage in garbage out is a common expression used to emphasize the importance of data quality for tasks such as machine learning, data analytics and business intelligence. With increasing amount of data being created and stored, building high quality data pipelines have never been more challenging. PySpark is a commonly used tool to build ETL pipelines for large datasets. A common question that arises while building data pipeline is "How do we know that our data pipeline is transforming the data in the way that is intended?". To answer this question, we borrow the idea of unit test from the software development paradigm.


Hopsworks 3.0: The Python-Centric Feature Store

#artificialintelligence

Feature stores began in the world of Big Data, with Spark being the feature engineering platform for Michelangelo (the first feature store) and Hopsworks (the first open-source feature store). Nowadays, the modern data stack has assumed the role of Spark for feature stores - feature engineering code can be written that seamlessly scales to large data volumes in Snowflake, BigQuery, or Redshift. However, Python developers know that feature engineering is so much more than the aggregations and data validation you can do in SQL and DBT. Dimensionality reduction, whether using PCA or Embeddings, and transformations are fundamental steps in feature engineering that are not available in SQL, even with UDFs (user-defined functions), today. Over the last few years, we have had an increasing number of customers who prefer working with Python for feature engineering.


Data Quality

#artificialintelligence

You can bet that you will be asked what kind of data issues you might encounter in your day job during one of your data engineer or data scientist interviews. Data quality will do more for model performance than any other technique. You could train a complicated deep learning model on massive amounts of data, but if the underlying data is bad, so too will the model's inference. In this article, we will attempt to address common data quality issues. As mentioned in the Kaggle tutorial on handling missing values, we need to distinguish between values that are missing because they were not recorded and values that are missing because they don't exist.


Building an Open Source ML Pipeline: Part 2

#artificialintelligence

In order to set up event-driven workflows we need to add another tool to our toolkit, namely Argo Events. Setting up Argo Events can be a bit tricky, but I provide the necessary YAML files to do so on Github here. We will start off with an example from their examples just to make sure that everything is installed correctly. So once you have cloned the repository feel free to take a look at the contents of these manifests. As far as modifications go, you need to replace the base64 encoded credentials in'minio-secret.yaml'


Reducing Pipeline Debt With Great Expectations

#artificialintelligence

This article was first published on Neptune AI's blog. You are a part of a data science team at a product company. Your team has a number of machine learning models in place. Their outputs guide critical business decisions, as well as a couple of dashboards displaying important KPIs that are closely watched by your executives day and night. On that fatal day, you had just brewed yourself a cup of coffee and were about to begin your workday when the universe collapsed. Everyone at the company went crazy. The business metrics dashboard was displaying what seemed to be random numbers (except every full hour, when the KPIs look okay for a short time) and the models were predicting the company's insolvency looming fast. What is worse, every attempt to resolve this madness resulted in your data engineering and research teams reporting new broken services and models. That was the debt collection day and the unpaid debt was of the worst kind: pipeline debt.


Memory and Knowledge Augmented Language Models for Inferring Salience in Long-Form Stories

arXiv.org Artificial Intelligence

Measuring event salience is essential in the understanding of stories. This paper takes a recent unsupervised method for salience detection derived from Barthes Cardinal Functions and theories of surprise and applies it to longer narrative forms. We improve the standard transformer language model by incorporating an external knowledgebase (derived from Retrieval Augmented Generation) and adding a memory mechanism to enhance performance on longer works. We use a novel approach to derive salience annotation using chapter-aligned summaries from the Shmoop corpus for classic literary works. Our evaluation against this data demonstrates that our salience detection model improves performance over and above a non-knowledgebase and memory augmented language model, both of which are crucial to this improvement.


Amazon and the Burden of Great Expectations

WSJ.com: WSJD - Technology

It proved to be a near-fatal mistake for their small business, which is the family's sole source of income, and employs all four of the couple's teenage and adult children, as well as Mrs. Wilsondebriano's sister and mother. In the Amazon AMZN -0.08% era, selling online is one thing, but actually getting products to customers fast enough to make them happy is something else. It's especially difficult if, like the Wilsondebrianos, a merchant isn't selling via Amazon, but still feels obligated to match the e-commerce giant's promises of free and fast delivery. Their sales plummeted, from upward of $20,000 a month to as little as $3,000 a month, Mrs. Wilsondebriano says. The family had no choice but to pack and ship orders themselves, since they could no longer afford the third-party shipper they had been using.


Why data quality is key to successful ML Ops

#artificialintelligence

Machine learning has been, and will continue to be, one of the biggest topics in data for the foreseeable future. And while we in the data community are all still riding the high of discovering and tuning predictive algorithms that can tell us whether a picture shows a dog or a blueberry muffin, we're also beginning to realize that ML isn't just a magic wand you can wave at a pile of data to quickly get insightful, reliable results. Instead, we are starting to treat ML like other software engineering disciplines that require processes and tooling to ensure seamless workflows and reliable outputs. "Poor data quality is Enemy #1 to the widespread, profitable use of machine learning, and for this reason, the growth of machine learning increases the importance of data cleansing and preparation. The quality demands of machine learning are steep, and bad data can backfire twice -- first when training predictive models and second in the new data used by that model to inform future decisions."


Great Expectations: Big Data and Laplace

#artificialintelligence

Scientific determinism as first published by Laplace in 1814 is an important and essential principle in the macro-world around us. We know that if we push something, it will move -- unless our impulse was not sufficient to overcome inertia… and so forth. Laplace postulated that if there were an omniscient daemon who knew the precise positions and impulses of each and every particle in a system, this daemon would be able to deterministically calculate each and every future state of this system. Our beloved spreadsheet calculations resemble this daemon (possibly in more than one connotation). Typing in some basic data to start calculations from, the wonderous spreadsheet software will automagically calulate everything depending on them, eventually deriving the results we wanted to obtain.